In [2]:
# importing useful python packages

import sklearn 
import numpy as np
import pandas as pd
import numpy.random as random

What is Supervised Learning?

0. Supervised vs. Unsupervised Learning

In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs. We are given a dataset with response variables and predictor variables and use this to try to come up with a function that maps predictor variables to response variables. We assume that there is some true mapping function $f$, and try to come up with an approximation $\hat f$ so that given new data, we can make accurate predictions.

Unsupervised learning is where you only have input data and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left on their own to discover interesting structure and patterns in the data.

1. Types of Variables: Predictors and Response

In a typical supervised learning problem, we have some number of predictors variables and a response (or outcome) variable. The outcome variable is the variable that we would like to predict. The predictor variables are the ones that we have at our disposal to try to determine the outcome variable. We usually label the number of predictor variables as $p$, and we call the $i$th instance of the $j$th predictor variable $X_{i,j}$. We call the $i$th instance of the response variable $Y_i$. Generally we write $Y = f(X) + \epsilon$, where $\epsilon$ represents the true error of the relationship between $X$ and $Y$, which is impossible to predict.

Summary of Notation:

$\hat{Y_i}$ is the predicted outcome from our model.

$Y_i$ is the true outcome.

$X_{i,j}$ is the jth predictor variable

We sometimes refer to the predictor variables collectively as a vector: $X_i$

2. Regression vs Classification

There are two types of supervised learning problems: regression and classification. In regression, we have a continuous response variable, with ordering. In classification, we have discrete response variables, without ordering.

For example, in a regression problem, we might try to predict the price of a house given predictor variables such as size, and geographic location.

In a classification problem we might try to predict whether someone has a disease or doesn't have a disease based on various health metrics

3. Measuring Performance: Loss Functions

Once we have a model, to evaluate how good it is we need some metric of performance. We call this metric the loss function. Below are two typical examples of loss functions, used for regression problems.

Residual Sum of Squares: $RSS = \sum_{i=1}^n{(y_i - \hat{y_i})^2}$

Mean Squared Error: $MSE = \frac{1}{n} \sum_{i=1}^n{(y_i - \hat{y_i})^2}$

Explain why the Residual Sum of Squares makes sense to evaluate performance for a regression problem

3. Splitting the Data into Training and Test subsets

One way to validate our model is to split the dataset we're given into random subsets, training and test. This is to prevent overfitting, where our predictive model is taking into account fluctuations due to the irreducible error ($\epsilon$) in the dataset rather than solely the function $f$ that relates $X$ (predictors) and $Y$ (response). Because of this, it's important to validate our model by testing it on a dataset other than the training set.

What are the disadvantages of this approach?

Another way to select a model is to use metrics (loss functions) that penalize flexibility in the model. We'll talk more about this later.

Activity:

Download the dataset: (blank) and split the dataset into two random sets: training and test using random.permutation


In [12]:
dataset = pd.read_csv('dataset_3.txt')
dataset = dataset.values
random_indices = dataset[random.permutation(len(dataset))]
train = random_indices[:len(random_indices)/2]
test  = random_indices[len(random_indices)/2:]

In [13]:
train


Out[13]:
array([[ 0.78872 ,  0.12372 ],
       [ 0.30162 ,  0.92028 ],
       [ 0.98219 ,  0.36954 ],
       [ 0.52378 ,  0.63821 ],
       [ 0.60952 ,  0.31562 ],
       [ 0.25461 ,  1.2268  ],
       [ 0.2193  ,  1.1652  ],
       [ 0.93116 ,  0.31258 ],
       [ 0.40037 ,  0.79861 ],
       [ 0.20943 ,  0.87333 ],
       [ 0.6277  ,  0.14077 ],
       [ 0.53018 ,  0.4121  ],
       [ 0.59201 ,  0.39336 ],
       [ 0.95933 ,  0.46815 ],
       [ 0.48321 ,  0.58507 ],
       [ 0.70083 ,  0.269   ],
       [ 0.75707 ,  0.030316],
       [ 0.38015 ,  0.82334 ],
       [ 0.98018 ,  0.25187 ],
       [ 0.27079 ,  0.81714 ],
       [ 0.79875 ,  0.17337 ],
       [ 0.9827  ,  0.27228 ],
       [ 0.029917,  0.78036 ],
       [ 0.69138 ,  0.19235 ],
       [ 0.80825 ,  0.27585 ],
       [ 0.64675 ,  0.24033 ],
       [ 0.28556 ,  0.94911 ],
       [ 0.55796 ,  0.45458 ],
       [ 0.90728 ,  0.089873],
       [ 0.14731 ,  1.0997  ],
       [ 0.51904 ,  0.6184  ],
       [ 0.5812  ,  0.37174 ],
       [ 0.78268 ,  0.21527 ],
       [ 0.016353,  0.91325 ],
       [ 0.47996 ,  0.77356 ],
       [ 0.58625 ,  0.4279  ],
       [ 0.32063 ,  0.97855 ],
       [ 0.24326 ,  1.0872  ],
       [ 0.89082 ,  0.13205 ],
       [ 0.8477  ,  0.079209],
       [ 0.65013 ,  0.25514 ],
       [ 0.87619 ,  0.30402 ],
       [ 0.715   ,  0.20274 ],
       [ 0.21487 ,  1.0851  ],
       [ 0.57685 ,  0.28319 ],
       [ 0.72453 ,  0.30067 ],
       [ 0.016186,  0.80712 ],
       [ 0.13795 ,  0.84285 ],
       [ 0.57254 ,  0.46251 ],
       [ 0.012743,  0.69141 ]])

In [14]:
test


Out[14]:
array([[ 0.24261 ,  1.0232  ],
       [ 0.90133 ,  0.21189 ],
       [ 0.65082 ,  0.11862 ],
       [ 0.832   ,  0.24985 ],
       [ 0.24265 ,  0.98807 ],
       [ 0.22379 ,  1.0543  ],
       [ 0.76157 ,  0.28938 ],
       [ 0.1428  ,  1.0873  ],
       [ 0.60749 ,  0.59353 ],
       [ 0.13412 ,  1.2569  ],
       [ 0.76252 ,  0.091609],
       [ 0.25648 ,  0.93484 ],
       [ 0.69451 ,  0.19398 ],
       [ 0.1575  ,  1.0454  ],
       [ 0.45317 ,  0.69052 ],
       [ 0.32607 ,  0.85856 ],
       [ 0.79801 , -0.008473],
       [ 0.18416 ,  1.1308  ],
       [ 0.16392 ,  1.2247  ],
       [ 0.46635 ,  0.50976 ],
       [ 0.03887 ,  0.79283 ],
       [ 0.69888 ,  0.29482 ],
       [ 0.52958 ,  0.60364 ],
       [ 0.50647 ,  0.69627 ],
       [ 0.5995  ,  0.28685 ],
       [ 0.058972,  0.94799 ],
       [ 0.68849 ,  0.1395  ],
       [ 0.38261 ,  0.75929 ],
       [ 0.63028 ,  0.31856 ],
       [ 0.58898 ,  0.35595 ],
       [ 0.57338 ,  0.42143 ],
       [ 0.23944 ,  1.02    ],
       [ 0.4773  ,  0.76368 ],
       [ 0.97694 ,  0.48927 ],
       [ 0.91329 ,  0.10798 ],
       [ 0.33226 ,  0.74293 ],
       [ 0.91942 ,  0.2101  ],
       [ 0.19446 ,  0.92892 ],
       [ 0.4773  ,  0.61643 ],
       [ 0.18921 ,  0.95759 ],
       [ 0.72757 ,  0.36886 ],
       [ 0.97652 ,  0.23641 ],
       [ 0.040263,  0.93099 ],
       [ 0.48976 ,  0.64098 ],
       [ 0.72363 ,  0.074964],
       [ 0.42039 ,  0.66792 ],
       [ 0.62843 ,  0.30405 ],
       [ 0.92423 ,  0.22167 ],
       [ 0.1087  ,  1.0019  ],
       [ 0.14056 ,  1.0594  ]])

In [ ]: